Skip to content

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Oct 9, 2025

What does this PR do?

This PR improves error handling for Elasticsearch output configurations in the Hybrid Elastic Agent by:

  1. Moving partially configuration translation ownership: Relocates some of the Elasticsearch output translation logic from the beats library (libbeat/otelbeat/oteltranslate/outputs/elasticsearch) into the elastic-agent package (internal/pkg/otel/translate/output_elasticsearch.go). In the future we should do a full transition to elastic-agent repo as this gives elastic-agent full control over the translation.

  2. Enabling graceful error handling: Adds continue_on_error: true to the beatsauth extension configuration in getBeatsAuthExtensionConfig(). This prevents the OpenTelemetry collector from exiting on startup when encountering invalid SSL configurations (e.g., missing certificate files) respective PR.

Why is it important?

When an Elasticsearch output has invalid configuration (like a missing SSL certificate), the collector exits with a vague error message that doesn't identify which output caused the failure:

error found during service initialization: failed to build extensions: failed to create extension "beatsauth": failed unpacking config: open /etc/client/cert.pem: no such file or directory

Benefits of this PR:

  • Collector stays running instead of exiting at startup, thus allowing other pipelines that utilise different exporters with valid config to continue push data
  • Errors are surfaced at the exporter level when requests are made, making it clear which output failed
image

Screenshot shows the intended behavior: collector continues running and errors are properly surfaced at the exporter level.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No disruptive user impact expected.

How to test this PR locally

build and install elastic-agent from this branch with the following configuration

  id: agent-pernode-debug
  outputs:
    default:
      hosts:
      - ${ES_HOST}
      password: ${ES_PASSWORD}
      type: elasticsearch
      username: ${ES_USERNAME}
      # invalid ssl settings
      ssl.certificate: /etc/client/cert.pem
      ssl.enabled: true
      ssl.key: /etc/client/cert.key
      ssl.key_passphrase: null
      ssl.key_passphrase_path: null
      ssl.verification_mode: none
    system:
      hosts:
      - ${ES_HOST}
      password: ${ES_PASSWORD}
      type: elasticsearch
      username: ${ES_USERNAME}
  secret_references: []
  agent:
    monitoring:
      # enable otel collector for self-monitoring
      _runtime_experimental: otel
      enabled: true
      logs: true
      metrics: true
      namespace: default
      use_output: default
  inputs:
    - data_stream:
        namespace: default
      id: system-logs
      streams:
      - data_stream:
          dataset: system.auth
          type: logs
        exclude_files:
        - \.gz$
        ignore_older: 72h
        multiline:
          match: after
          pattern: ^\s
        paths:
        - /var/log/auth.log*
        - /var/log/secure*
        processors:
        - add_locale: null
        tags:
        - system-auth
      - data_stream:
          dataset: system.syslog
          type: logs
        exclude_files:
        - \.gz$
        ignore_older: 72h
        multiline:
          match: after
          pattern: ^\s
        paths:
        - /var/log/messages*
        - /var/log/syslog*
        - /var/log/system*
        processors:
        - add_locale: null
        tags: null
      type: logfile
      use_output: system
    - data_stream:
        namespace: default
      id: system-metrics
      streams:
      - cpu.metrics:
        - percentages
        - normalized_percentages
        data_stream:
          dataset: system.cpu
          type: metrics
        metricsets:
        - cpu
        period: 10s
      - data_stream:
          dataset: system.diskio
          type: metrics
        diskio.include_devices: null
        metricsets:
        - diskio
        period: 10s
      - data_stream:
          dataset: system.filesystem
          type: metrics
        metricsets:
        - filesystem
        period: 1m
        processors:
        - drop_event.when.regexp:
            system.filesystem.mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)
      - data_stream:
          dataset: system.fsstat
          type: metrics
        metricsets:
        - fsstat
        period: 1m
        processors:
        - drop_event.when.regexp:
            system.fsstat.mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)
      - condition: ${host.platform} != 'windows'
        data_stream:
          dataset: system.load
          type: metrics
        metricsets:
        - load
        period: 10s
      - data_stream:
          dataset: system.memory
          type: metrics
        metricsets:
        - memory
        period: 10s
      - data_stream:
          dataset: system.network
          type: metrics
        metricsets:
        - network
        network.interfaces: null
        period: 10s
      - data_stream:
          dataset: system.process
          type: metrics
        metricsets:
        - process
        period: 10s
        process.cgroups.enabled: false
        process.cmdline.cache.enabled: true
        process.include_cpu_ticks: false
        process.include_top_n.by_cpu: 5
        process.include_top_n.by_memory: 5
        processes:
        - .*
      - data_stream:
          dataset: system.process_summary
          type: metrics
        metricsets:
        - process_summary
        period: 10s
      - data_stream:
          dataset: system.socket_summary
          type: metrics
        metricsets:
        - socket_summary
        period: 10s
      - data_stream:
          dataset: system.uptime
          type: metrics
        metricsets:
        - uptime
        period: 10s
      type: system/metrics
      use_output: system

Related issues

N/A


This is an automatic backport of pull request #10343 done by [Mergify](https://mergify.com).

@mergify mergify bot requested a review from a team as a code owner October 9, 2025 08:42
@mergify mergify bot requested review from michalpristas and straistaru and removed request for a team October 9, 2025 08:42
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Oct 9, 2025
Copy link
Contributor Author

mergify bot commented Oct 9, 2025

Cherry-pick of 0c0dada has failed:

On branch mergify/bp/8.19/pr-10343
Your branch is up to date with 'origin/8.19'.

You are currently cherry-picking commit 0c0dada00.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   internal/pkg/otel/translate/otelconfig_test.go
	new file:   internal/pkg/otel/translate/output_elasticsearch.go
	new file:   internal/pkg/otel/translate/output_elasticsearch_test.go
	modified:   testing/integration/ess/otel_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   internal/pkg/otel/translate/otelconfig.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@github-actions github-actions bot added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog labels Oct 9, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor Author

mergify bot commented Oct 13, 2025

This pull request has not been merged yet. Could you please review and merge it @pkoutsovasilis? 🙏

* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)

# Conflicts:
#	internal/pkg/otel/translate/otelconfig.go
@pkoutsovasilis pkoutsovasilis force-pushed the mergify/bp/8.19/pr-10343 branch from a307675 to 78ba0f1 Compare October 14, 2025 08:31
@pkoutsovasilis pkoutsovasilis removed the conflicts There is a conflict in the backported pull request label Oct 14, 2025
@ebeahan
Copy link
Member

ebeahan commented Oct 17, 2025

@pkoutsovasilis is this backport waiting on some other dependencies to merge into 8.19 first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants